Skip to content

fix: RegistryTEEConnection.reconnect() silently swallows _connect() failures#218

Open
amathxbt wants to merge 2 commits intoOpenGradient:mainfrom
amathxbt:fix/tee-reconnect-silent-failure
Open

fix: RegistryTEEConnection.reconnect() silently swallows _connect() failures#218
amathxbt wants to merge 2 commits intoOpenGradient:mainfrom
amathxbt:fix/tee-reconnect-silent-failure

Conversation

@amathxbt
Copy link
Copy Markdown
Contributor

🐛 Critical Bug: Silent TEE Failover Failure

File: src/opengradient/client/tee_connection.pyRegistryTEEConnection.reconnect()

What's broken

The try/except Exception block in reconnect() accidentally wraps the entire body including self._active = self._connect(). If _connect() raises (registry down, no active TEE, TLS error), the exception is silently swallowed at DEBUG level and self._active is never updated.

The except was intended to guard old-client cleanup (matching StaticTEEConnection), not the reconnect itself.

Before (buggy):

async with self._refresh_lock:
    try:
        self._active = self._connect()  # ← failure here is silently eaten
    except Exception:
        logger.debug("Failed to close previous HTTP client...")  # wrong scope

After (fixed):

async with self._refresh_lock:
    old_client = self._active.http_client
    self._active = self._connect()  # ← now propagates on failure
    try:
        await old_client.aclose()  # ← only cleanup is guarded
    except Exception:
        logger.debug("Failed to close previous HTTP client...")

Impact

Before After
TEE failover silently does nothing when registry/TEE is unreachable Exception propagates; callers surface a clear error
self._active stays stale pointing to dead connection self._active only updated on success
Root cause hidden under confusing downstream errors Fail fast with actionable message

Verification

Matches the pattern already used correctly in StaticTEEConnection.reconnect() (L78–L85).

@amathxbt
Copy link
Copy Markdown
Contributor Author

amathxbt commented Mar 31, 2026

@adambalogh can you take quick action is something critical
Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant